A schema is a contract. It says: every piece of knowledge stored in this system will have these exact fields, in these exact formats, every single time.
Think of it like a form. If you go to a hospital, every patient record has the same fields: name, date of birth, blood type, allergies, medications. They don't let one doctor write "Bob, he's 40ish, allergic to something" and another doctor write a structured record. The form IS the schema. It enforces consistency.
Why does this matter for AI?
Your sovereign AI stores thousands of pieces of knowledge. Principles, tactics, lessons learned, frameworks. If each one is stored differently — some with a confidence score, some without, some with a source name, some with a hex ID, some with mechanism explanations, some without — you can never reliably search, compare, or transfer knowledge between systems.
Say you extract a principle from a marketing course:
"Lead with outcomes, not mechanisms, when selling to skeptical men." Stored as NODE_SCHEMA: id: "a7f3b2c1-..." (unique forever) text: "Lead with outcomes..." (the actual principle) title: "Outcomes before mechanisms" node_type: "principle" source_id: "Anatomy of Ads 2.0" (where it came from) confidence_score: 0.92 (how validated it is) tags: "cold_traffic,identity,masculine" mechanism: "Skeptical men evaluate outcome identity before caring about how-to" situation: "Cold traffic ads for identity-based offers" when_not: "Warm retargeting where credibility is established"
Every single principle in the system has these same fields. You can search by confidence. You can filter by tags. You can retrieve by situation. You can compare mechanisms. You can track where it came from. The schema makes the knowledge machine-readable, not just human-readable.
Rob's early system stored knowledge as "holons" — semi-structured blobs with varying fields. Some had sources, some didn't. Some had categories, some were just raw text. When you want to ask "show me all principles about cold traffic with confidence above 0.8" — you can't, because some holons don't have a confidence field, and some don't have category tags.
A schema solves this by requiring every field to exist on every record, even if it's empty. You always CAN query confidence, even if some nodes are at the default 0.75 because they haven't been validated yet.
Your system (VOHU MANAH) has 4 schemas because it evolved over time. Each was created for a different purpose.
14 fields. Used by 7 collections. This is the rich, fully structured schema for actual knowledge — principles, tactics, examples, book excerpts. Every field is intentional:
| Field | Why it exists |
|---|---|
| id | Unique forever. Never changes. Lets you reference a specific principle across systems. |
| vector | The embedding (we'll explain this in Part 3). Lets you search by meaning, not keywords. |
| text | The actual content. What gets embedded and what humans read. |
| title | Short label for display. "Outcomes before mechanisms" vs the full text. |
| node_type | What kind of knowledge this is. A principle (general truth), a tactic (specific action), a concept (abstract idea), an example (situated instance). |
| source_id | Where it came from. "Anatomy of Ads 2.0" — traceable back to the original material. |
| framework_id | Which cluster of related principles this belongs to. E.g. "cold_traffic" framework. |
| confidence_score | 0.0 to 1.0. How validated this principle is. Starts at extraction quality. Rises when the principle works in the real world. Falls when it doesn't. |
| tags | Searchable labels. "copywriting,cold_traffic,identity" — comma-separated. |
| mechanism | HOW/WHY this works. The causal explanation. Not just "do this" but "this works because..." |
| situation | WHEN to apply this. The context where this principle is valid. |
| when_not | WHEN NOT to apply this. Just as important — prevents misapplication. |
| collection | Which collection it lives in (principles, tactics, etc). |
| date_added | When it was ingested. For tracking recency. |
7 fields. Used by 4 collections. Created earlier, for simpler document storage (positioning docs, reference sites). Doesn't have mechanism, situation, when_not, framework_id, or even an id field. It's basically: text + source + confidence + tags.
9 fields. Stores full chat conversations as JSON blobs. Completely different purpose — this is session history, not knowledge. Not part of codex exchange.
16 fields. Stores synthesized long-form content (trunk, branches, leaves, threads). This is the OUTPUT of the synthesis pipeline, not atomic knowledge. Not part of codex exchange as-is.
Rob uses a different storage format entirely. His 16,717 holons are stored in LanceDB but with a different schema — less structured than NODE_SCHEMA, more like LEGACY. His holons have: text, vector (768-dim — different embedding model), and varying metadata. No standardized mechanism/situation/when_not fields.
This is the most important concept to understand. Everything else flows from it.
A computer sees "the dog sat on the mat" and "the canine rested on the rug" as completely different strings. Different characters, different lengths. To a computer doing string comparison, these have zero similarity.
But to a human, they mean the same thing.
An embedding is a way to convert meaning into numbers. Specifically, into a list of numbers (a "vector") where similar meanings produce similar numbers.
An embedding model is a neural network that has been trained on billions of text examples. It learned that "dog" and "canine" appear in similar contexts, so they should map to similar numbers. It learned that "the dog sat on the mat" and "investment banking regulations" appear in completely different contexts, so they should map to very different numbers.
When you feed text into an embedding model, it outputs a list of numbers. Like this:
"Lead with outcomes, not mechanisms" → [0.23, -0.15, 0.87, 0.02, -0.41, ... ] (1024 numbers) "Show results before explaining how" → [0.21, -0.14, 0.85, 0.03, -0.39, ... ] (1024 numbers) "How to change a car tire" → [-0.67, 0.33, -0.12, 0.55, 0.08, ... ] (1024 numbers)
The first two are about the same concept (outcome-first marketing). Their numbers are almost identical. The third is about something completely different. Its numbers are completely different.
This is the dimension of the embedding. More dimensions = more nuance. Think of it like describing a color:
768 dimensions (Rob's model) vs 1024 dimensions (your model) means your model captures slightly more nuance. Whether that matters depends on the data.
When you ask "how do I sell to skeptical men?", the system:
This is semantic search — search by meaning, not keywords. You don't need to use the exact words that are in the stored principle. "How do I sell to skeptical men?" finds "Lead with outcomes, not mechanisms" because the embeddings capture the semantic relationship.
Q's system uses BGE-M3 — produces 1024 numbers per text.
Rob's system uses nomic-embed-text — produces 768 numbers per text.
These are not compatible. You cannot compare a list of 1024 numbers to a list of 768 numbers. It's like trying to compare a 3D object to a 2D photograph of it — they represent the same thing but in different dimensional spaces. The math doesn't work.
This means:
There are hundreds of embedding models. For Meridian, only a handful are realistic because we need: runs locally on CPU (sovereignty), open source (no API dependency), high quality (search must be accurate), and proven at scale.
| Model | Dimensions | Size | Who | Quality (MTEB) | Speed (CPU) | Notes |
|---|---|---|---|---|---|---|
| BAAI/bge-m3 | 1024 | 1.3 GB | Beijing Academy of AI | Very high (top 5 on MTEB retrieval) | ~0.5s per text on CPU | Q's current model. Multilingual. Supports dense + sparse + multi-vector. The most versatile option. |
| nomic-embed-text | 768 | 274 MB | Nomic AI | Good (comparable to OpenAI ada-002) | ~0.2s per text on CPU | Rob's current model. Smaller, faster. Open source. Less nuanced than BGE-M3. |
| BAAI/bge-large-en-v1.5 | 1024 | 1.2 GB | Beijing Academy of AI | High | ~0.4s per text | English-only predecessor to BGE-M3. Slightly worse quality. Same dimensions. |
| sentence-transformers/all-MiniLM-L6-v2 | 384 | 80 MB | Sentence Transformers | Medium | ~0.05s per text | Very fast, very small, but 384-dim means less nuance. Fine for simple search, not enough for Meridian's knowledge density. |
| Cohere embed-v3 | 1024 | API only | Cohere | Very high | Fast (API) | Top quality but requires API — breaks sovereignty. Not viable for air-gapped clients. |
| OpenAI text-embedding-3-large | 3072 | API only | OpenAI | Highest | Fast (API) | Best quality available but API-only + closed source. Non-starter for sovereignty. Also 3072-dim = 3x storage cost. |
| Snowflake/arctic-embed-l | 1024 | 1.1 GB | Snowflake | High | ~0.4s per text | Strong retrieval performance. Open source. 1024-dim. Worth benchmarking against BGE-M3. |
| Alibaba/gte-Qwen2-7B-instruct | 3584 | 14 GB | Alibaba | Near-best | Very slow on CPU | 7B parameter model — runs as a full LLM. Highest quality local option but requires GPU and massive resources. Not practical for client builds. |
| Pros | Cons |
|---|---|
|
|
| Pros | Cons |
|---|---|
|
|
Here's why, factor by factor:
| Factor | BGE-M3 wins? | Why |
|---|---|---|
| Quality | Yes | Higher MTEB scores. Better retrieval means better agent responses, better codex integration, better synthesis. |
| Multilingual | Yes | Will speaks French, Spanish, Russian, Italian. Clients may have knowledge in multiple languages. nomic is English-only. |
| Scalability | Yes | 1024-dim is becoming the industry standard. Future models will likely output 1024+. Starting at 768 means migrating later. |
| Production validation | Yes | 6,797 nodes, 27,338 edges, 60 evergreen frameworks already proven on BGE-M3. We know it works. |
| Speed | No | nomic is 2.5x faster. But 0.5s vs 0.2s per query is imperceptible to a human. Only matters for bulk ingestion. |
| Size | No | 1.3 GB vs 274 MB. Matters on a Raspberry Pi. Doesn't matter on a machine with 64 GB RAM. |
| Storage | No | 4 KB vs 3 KB per vector. At 10,000 nodes: 40 MB vs 30 MB. Negligible. |
The speed and size advantages of nomic are real but irrelevant at Meridian's scale. The quality and multilingual advantages of BGE-M3 are decisive.
New embedding models come out every few months. If a significantly better model appears in 2027, can we switch?
Theoretically yes, practically it's expensive. The cost is re-embedding everything. For the founding three, manageable (a few hours). For 33 nodes? A weekend project. For 100+? A major migration event.
The mitigation: the schema stores the raw text alongside the vector. You always have the original text. Re-embedding means reading every text field and running it through the new model. Nothing is lost — it's just compute time.
This is why storing the full text (not just the embedding) in NODE_SCHEMA is critical. The text is permanent. The embedding is a function of the text + the model. If the model changes, you regenerate. If the text is gone, you're dead.
Rob's GHOSTNET uses a different stack at every level:
| Component | Rob (GHOSTNET) | Q (VOHU MANAH) | Meridian Base Model |
|---|---|---|---|
| Embedding model | nomic-embed-text (768-dim) | BGE-M3 (1024-dim) | BGE-M3 (1024-dim) |
| Embedding via | Ollama API | sentence-transformers (Python) | sentence-transformers (Python) |
| Storage | LanceDB (memories.lance) | LanceDB (knowledge.db/) | LanceDB (knowledge.db/) |
| Schema | Semi-structured (varying fields) | NODE_SCHEMA (14 fields, enforced) | NODE_SCHEMA v3 (14+ fields, enforced) |
| Collections | 4 (memories, dreams, synthesis, errors) | 13 (7 node + 4 legacy + 2 custom) | TBD — minimum: knowledge + errors + dreams |
| Graph | None (flat holon structure) | SQLite kg_edges (27,338 edges) | SQLite kg_edges |
| Interface | AnythingLLM workspace | Telegram + 4 Dash apps + Copilot | TBD — likely Open WebUI + RAG plugin |
His security hardening, dream engine, and swarm architecture are all above the schema layer. They don't need to change. The schema is the data format. The applications built on top of it are independent.
TIER 1: NODE_SCHEMA (the protocol — exchangeable) Every principle, tactic, concept, example, error, dream_insight. 14 fields + v3 additions (gravity_score, validation_count, error_count). This is what codex packs contain. This is what gets emitted to the collective. This is the interoperability guarantee. Embedding: BGE-M3, 1024-dim, local CPU. TIER 2: SYSTEM SCHEMAS (internal — never exchanged) CONV_SCHEMA — conversation records (session history) EVERGREEN_SCHEMA — synthesis output pages SNAPSHOT_SCHEMA — system vital signs over time AGENT_ACTIVITY — per-agent activity logs (for dreaming) These never leave the sovereign node. Each client's system tables are their own business. TIER 3: GRAPH SCHEMA (relationship layer — exchangeable) kg_edges — connections between NODE_SCHEMA nodes Edges ARE part of codex packs (they're the knowledge structure). edge_id, from_id, to_id, rel_type, weight, notes, created_at LEGACY_SCHEMA: RETIRE os_context, reference_sites → migrate to NODE_SCHEMA or mark as system-only (not codex-compatible)